Words and Word Usage: Newspaper Text versus the Web
نویسندگان
چکیده
This paper explores the differences in words and word usage in two corpora – one derived from newspaper text and the other from the web. A corpus of web pages is compiled from a controlled traversal of the web, producing a topicdiverse collection of 2 billion words of web text1. We compare this Web Corpus with the Gigaword Corpus, a 2 billion word collection of news articles. The Web Corpus is applied to the task of automatic thesaurus extraction, obtaining similar overall results to using the Gigaword. The quality of synonyms extracted for each target word is dependent on the word’s usage in the corpus. With many more words available on the web, a much larger Web Corpus can be created to obtain better results in different nlp tasks.
منابع مشابه
Word Usage : Newspaper Text versus the Web
This paper explores the differences in words and word usage in two corpora – one derived from newspaper text and the other from the web. A corpus of web pages is compiled from a controlled traversal of the web, producing a topicdiverse collection of 2 billion words of web text1. We compare this Web Corpus with the Gigaword Corpus, a 2 billion word collection of news articles. The Web Corpus is ...
متن کاملارائه روشی برای استخراج کلمات کلیدی و وزندهی کلمات برای بهبود طبقهبندی متون فارسی
Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...
متن کاملLanguage Models
A language model assigns a probability to a piece of unseen text, based on some training data. For example, a language model based on a big English newspaper archive is expected to assign a higher probability to “a bit of text” than to “aw pit tov tags”, because the words in the former phrase (or word pairs or word triples if so-called N -GRAM MODELS are used) occur more frequently in the data ...
متن کاملEXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS
Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...
متن کاملConstruction and Analysis of a Large Vietnamese Text Corpus
This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks f...
متن کامل